Skip to content

Conversation

yuandrew
Copy link
Contributor

@yuandrew yuandrew commented Sep 29, 2025

What was changed

NOTE: targeting worker-heartbeat feature branch.

New in memory mechanism to keep track of certain metrics for worker heartbeating.

Why?

Checklist

  1. Closes

  2. How was this tested:

  1. Any docs updates needed?

Note

Introduce in-memory metrics for worker heartbeats, populate/export detailed heartbeat, poller, and slot info, track slot-supplier kind, and add DescribeWorker/SetWorkerDeploymentManager APIs with tests.

  • Telemetry/Metrics:
    • Add in-memory metric tracking (HeartbeatMetricType, WorkerHeartbeatMetrics) and wire into Counter/Gauge/HistogramDuration via *_with_in_memory helpers.
    • Extend MetricsContext::Instruments to record heartbeat-related counters/gauges/histograms; add NoOp attributes support.
  • Worker/Runtime:
    • Populate WorkerHeartbeat with SDK identity, times, poller info (incl. last successful poll), slot usage, sticky cache, plugins, and status.
    • Track last successful poll times in poll buffers/scaler; add worker status and expose worker_instance_key() on API.
    • Add WorkerConfig fields: plugins, skip_client_worker_set_check.
    • Track slot-supplier kind (Fixed/ResourceBased/custom) and include in heartbeat.
    • Improve resource-based tuner sysinfo with background refresh thread.
  • Client/Registry:
    • ClientWorkerSet::register accepts skip_client_worker_set_check; heartbeat callback uses Arc.
    • Client sets heartbeat client-derived fields and computes last-interval slot counters.
  • APIs/Protos/OpenAPI:
    • Add DescribeWorker RPC and wire through client/C-bridge.
    • Add SetWorkerDeploymentManager RPC and related messages.
    • Update multiple proto/OpenAPI structs (namespace capabilities, deployment info, history-reverse doc, priority doc, etc.).
  • Tests:
    • New integration tests for worker heartbeat data/metrics, tuner behavior, multiple workers, and config flags; adjust existing tests for API changes.

Written by Cursor Bugbot for commit fc7f839. This will update automatically on new commits. Configure here.

@yuandrew yuandrew marked this pull request as ready for review October 5, 2025 03:34
@yuandrew yuandrew requested a review from a team as a code owner October 5, 2025 03:34
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

impl WorkerHeartbeatMetrics {
pub fn get_metric(&self, name: &str) -> Option<HeartbeatMetricType> {
match name {
"sticky_cache_size" => Some(HeartbeatMetricType::Individual(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these names are duplicative of some existing metric name in core/src/telemetry/metrics.rs, which is calling through into this anyway. Rather than having all the fields be pub, I think it'd be safer to make getters for each, individually, that do the same thing this match does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that do the same thing this match does.

Not sure I follow this part, happy to make getters so all fields aren't pub, but it sounds like you're suggesting a change to the match itself? We need some way to map the str name to the struct field. Are you suggesting moving this whole fn over to MetricParameters?

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of the tests are failing too

#[builder(default)]
pub plugins: Vec<PluginInfo>,

/// Skips the single worker+client+namespace+task_queue check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH not letting me suggest this line for whatever crazy reason, but:

"Skips the check that enforces only one worker per client may exist with the same namespace/task queue combination"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, actually I guess it's literally disabling registration entirely. If this is true, even if there are no other conflicting workers, will it end up disabling the worker heartbeating? Definitely shouldn't if it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes! That's def not what we want. Added a test to verify this behavior

@yuandrew yuandrew force-pushed the worker-heartbeat-telemetry branch from 398b5ce to 33fd78d Compare October 14, 2025 23:24
@yuandrew yuandrew force-pushed the worker-heartbeat-telemetry branch from 33fd78d to f1a3634 Compare October 14, 2025 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants